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Abstract 

In this paper we generalize the framework of the feasible descent method (EDM) 
to a randomized (R-EDM) and a coordinate-wise random feasible descent method 
(RC-EDM) framework. We show that the famous SDCA algorithm for optimiz¬ 
ing the SVM dual problem, or the stochastic coordinate descent method for the 
LASSO problem, fits into the framework of RC-EDM. We prove linear conver¬ 
gence for both R-EDM and RC-EDM under the weak strong convexity assump¬ 
tion. Moreover, we show that the duality gap converges linearly for RC-EDM, 
which implies that the duality gap also converges linearly for SDCA applied to 
the SVM dual problem. 


1 Introduction 

In this paper we are interested in the following optimization problem 

min/(x), (1) 

x^X 

where the function / is smooth and convex, and X C R” is a convex set. The Eeasible Descent 
Method (EDM) 1171191 [TSlI is any algorithm, which produces a sequence of points where 

there exist constants P > 0, ( > 0 and ojk > Co > 0, such that the following 3 inequalities hold for 
every iteration k: 

Xk+i =Projx {xk - ujk'^f{xk) + Zk), ( 2 ) 

\\zk\\ < P\\xk - Xk+i\\, (3) 

f{xk+i)< f{xk)-C\\xk-Xk+i\\‘^, (4) 

where Proj^^ (y) := argmin^j^x \\x — y\\ is the projection of y onto X. 

As was shown in Q, many first order algorithms, including steepest descent, the gradient projection 
algorithm, the extra gradient method, the proximal minimization algorithm and the cyclic coordinate 
descent method, fit into the framework of EDM. However, randomized first order algorithms are 
becoming more and more popular nowadays, and the following question naturally arises; 
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“Can the framework ofFDM be extended to a randomized setting?” 

In this paper we give an affirmative answer to this question: we show that, indeed, a randomized ver¬ 
sion of FDM can be formulated and we will show that, for example, the inexact gradient projection 
algorithm (when the gradient is corrupted with random noise) or the stochastic coordinate descent 
method, fit into this new framework. 

1.1 Assumptions and Notations 

In this section we state the assumptions and introduce the notation that will be used in this paper. 

The hrst assumption we make is that the function / enjoys weak strong convexity, which is captured 
by the following. 

Assumption 1. We assume that there exists a positive vector w G R++ such that the function f{x) 
satisfies the weak strong convexity property on the set X, which is defined as 

f{x) - f{x) > Kf\\x - x\\‘^, Vx € X, (5) 

where f* = argmin^^gx/(x), x = a.rgmmy^x:f{y)=f\\x-y\\w, \\x\\w = 1]”=! 

W — diag(w), and Kf > 0. 

Let us remark that if / is smooth and has a Lipschitz continuous gradient, then Assumption [T] is 
weaker than the strong convexity assumption or the global error bound property 0. 

The second assumption we make regards the smoothness of /, and is defined precisely as follows. 

Assumption 2. We assume that f{x) has a coordinate-wise Lipschitz continuous gradient with 
constants Li, i.e. 'ix G X andMS G R : 2 : -I- Sci £ X the following inequality holds 

\V^f{x)-V^f{x + 5ei)\<L,\5\, (6) 

where denotes the i-th column of the identity matrix I £ 

As it was shown in IfT^ . Assumptionimplies that the function f{x) has a Lipschitz continuous 
gradient with Lipschitz constant Ly > 0 with respect to the norm || • || w, i-e. Va:, y G X we have 

\\yf{x)-Xf{y)ryy<LY\\x-y\\w, (D 

where ||a:||^ = is the dual norm to || • \\w- Moreover, it was shown in ifT^ that 

tW ^ Li 

- ^i=i m' 

Let us define the projection operator onto the set X, with respect to the norm || • || w, as follows 

n 

ProJx (x) = argmin ||a; - 2 / 11 ^ = argminY^ (8) 

V^x y£X ^ 

where denotes the i-th coordinate of the vector x. 

1.2 Applications 

In this section we discuss several problems that arise in the optimization and machine learning 
literature, which fit into the FDM framework that we analyze in this paper. We also provide details 
showing that, for each problem, the objective function satisfies the assumptions in Section \n\ (A 
discussion on the value of the weak strong convexity parameter Kf will be given in SectionH]) 

The dual of SVM. Consider the classical linear S VM problem. The goal is, given n training points 
(oi, 2 /i), where £ R^^ are the features for point i and yi £ {—1, -1-1} is its label, find w £ R'^ 
such that the regularized empirical loss function is minimized, i.e., one can minimize the following 
optimization problem 

min {P{w) := ^TiYMw^ai) -f } , (9) 
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where A > 0 is a regularization parameter, and, in the case of SVM, the function ii{w’^ai) = 
max{0,1 — UiViFtti} is the hinge loss. Clearly, the objective function (|9|l is not smooth. However, 
one can formulate the dual ll^ fT4l[T6ll 

min {f{x) := , (10) 

where Qij = yiUj {ai, Uj), and 1 denotes the vector of all ones, which is smooth. 

Lasso problem and least squares problem. Consider the following optimization problem 

min 5 (x) + A||a;||i, (11) 

where A > 0 and g{x) is a smooth function with the special structure; g{x) = h{Ax) + q^x, where 
A G is some data matrix, q G M" is some vector and h is a strongly convex function. It 

is a simple exercise to show that, if we double the dimension of x to \x'^]x~], we can replace the 
term A||a;|| i in (fTTT i with Al^x+ + Al^x“ and impose the constraints x'^,x~ > 0. Then the Lasso 
problem (fTTT i can be reformulated as a smooth optimization problem with simple box constraints. 

^2 regularized empirical loss minimization. Many machine learning problems have the follow¬ 
ing structure m 

1 " A 

min/(x) = 

n ^' 2 

i=l 

where A > 0 is a regularization parameter and ii is a loss function. Because we assume that / 
must be smooth, the following commonly used loss functions fit our assumptions: the logistic loss 
function (of x) = log(l + exp(—x)); the squared loss function (af x) = (t/j — afx)^ and 
the squared hinge loss function £i{afx) = (max{0,1 — yiofx})'^. 

1.3 Related work 

Luo and Tseng JT) are among the first to establish asymptotic linear convergence for a non-strongly 
convex problem under the local error bound property. They consider a class of feasible descent 
methods (which includes e.g. the cyclic coordinate descent method). The error bound measures 
how close the current solution is to the optimal solution set with respect to the projected gradient. 
Recently, ifTSl proved that the feasible descent method enjoys a linear convergence rate (from the 
beginning, rather than only locally) under the global error bound property. Considering the class 
of smooth constrained optimization problems with the global error bound property, UllIIol showed 
a linear convergence rate for the parallel version of the stochastic coordinate descent method. In 
a the authors analyzed the asynchronous stochastic coordinate descent method (SCDM) under 
the weak strong convexity assumption. Very recently, a showed that, if the objective function is 
smooth, then the class of problems with the global error bound property is a subset of the class of 
problems with the weak strong convexity property. 

1.4 Contributions 

In this Section we list the most important contributions of this paper (not in order of their signifi¬ 
cance): 

• Randomized and Randomized Coordinate Feasible Descent Methods. We extend the 
well known framework of Feasible Descent Methods (FDM) IT] to randomized and ran¬ 
domized coordinate FDM and show that the SCDM algorithm fits into our new proposed 
framework. 

• Linear Convergence Rate. We show that any stochastic or deterministic algorithm, which 
fits our Randomized FDM (R-FDM) or Randomized Coordinate-FDM (RC-FDM) frame¬ 
work and satisfies our previously stated assumptions, converges linearly in expectation. 

• Linear Convergence of the Duality Gap for SDCA for SVM. As a consequence of our 
analysis, we show that when SDCA is applied to the dual of the SVM problem, the duality 
gap converges linearly. 
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1.5 Paper Outline 

In Section |2] we derive the Randomized (R-FDM) and the Randomized Coordinate (RC-FDM) Fea¬ 
sible Descent Method. In Section [3 we derive the convergence rate for any method which hts into 
the R-FDM or RC-FDM framework and we compare our results with those in 13 for SCDM. In 
Section |4] we briefly review the global error bound property and using the result in ||3 we compare 
our convergence results with lITSl . In Section |3 we show that the duality gap converges linearly for 
SDCA applied to the dual of the SVM problem, and in Section|6]we present a brief summary. 

2 Randomized and Randomized Coordinate Feasible Descent Method 

The framework of Feasible Descent Methods (FDM) broadly covers many algorithms that use hrst- 
order information Q including gradient descent, cyclic coordinate descent and also the inexact 
gradient descent algorithm. We generalize the classical FDM framework to a randomized setting, 
which we call the Randomized Feasible Descent Method (R-FDM). To the best of our knowledge 
this is the hrst time such a framework has been considered and that a global linear convergence rate 
has been established under Assumptions[T]and|2] Further, we also show that the popular minibatch 
stochastic coordinate descent/ascent method hts into the R-FDM framework. 

Definition 3 (Randomized Feasible Descent Method (R-FDM)). A sequence {xk}'^=Q A generated 
by R-FDM if there exist /3 > 0, C > 0 and with minfcWfc > w > 0 such that for every 

iteration k, the following conditions are satisfied 

Xk+i = Proj^^ {xk - uJkW~^[S/f{xk) - Zk)) , (12) 

mzkrwf]<fi^n\w-xk+ifwf (o) 

Wi^k+i)] < fixk) - CE[||a:fc - Xk+i fwi (14) 

where Zk is some random vector that satisfies the Markov property conditioned on Xk- 

We will now compare the new Randomized FDM framework (Definition O with the original FDM 
((|2|i-(|4|i), where, for simplicity of exposition, we will take || • || w = || ■ II 2 (i-e., W = I). Notice that 
the first step of R-FDM (fTSI) is the same as the first step of FDM (|2|l. The key difference between 
FDM and R-FDM is that for FDM, Q and (|4]l hold deterministically (with a deterministic vector 
Zk), whereas for R-FDM Q and (|4li only need to hold in expectation. That is, for R-FDM, conditions 
(O and (HI are replaced by conditions (fTSl) and (fT4l i. where Zk is a random vector. Notice that (fTSl) 
and (fT4l) are weaker conditions than Q and (|4li. That is, for FDM, Q and (IH must hold at every 
iteration (i.e., they are deterministic), whereas for the R-FDM framework, the conditions ( fTSl ) and 
(O are equivalent to Q and (|4|i holding only on average. Thus, the R-FDM framework is more 
general than FDM. 

Remark 4. We will see later (in the proof of convergence of R-FDM) that (1131) can be relaxed to the 
existence of constant rj > 0 such that E[(|| 2 ;fc|||^)^] < ? 7 E[||a;fe — Xk+i\\w]- 

We will now demonstrate that (see Theorem |6]), under an additional mild assumption, if the set 
X = R", then SCDM (captured in Algorithm [T] with Option I.) is equivalent to R-FDM. We also 
remark that there is a need to modify R-FDM so that the minibatch stochastic coordinate descent 
method can be analyzed even when X f R”. However, first we describe SCDM and make the 
following assumption in order to establish the equivalence of SCDM with X — R” and R-FDM. 

Assumption 5. The function f is coordinate-wise strongly convex with respect to the norm II ■ lltv 
with parameter"/ > 0, if for any x G X and any i G {1, 2,..., n} we have 

f{x^^\ ..., ..., - fix) + 0 - xW p. (15) 

Note that Assumption |3 (foes not imply strong convexity of the function f. For example, dni) is 
satisfied for the Lasso problem or for the SVM dual problem whenever Vi : ||ai|| > 0, and neither 
of those problems is strongly convex. 

Theorem 6. Let Assumptions\I\^and\^hold. If X = R" then the Stochastic Coordinate Descent 
Method (SCDM) (Algorithm\I\with Option I.) is equivalent to R-FDM with the parameters /3^ = 

2[(Ljf -H 1] -f (n — l)r^, C = 7 ^k = 1. where = max^ 
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Algorithm 1 Stochastic Coordinate Descent Method (SCDM) 

1 

Input: f(x), 

diagonal matrix W y 0, xq, size of minibatch t G (1,2,... , n} 

2 

Input: X = Xi X • • • 

X Xn, where Xi = [0,5] with —00 < a < b < -foo 

3 

while fc > 0 : do 


4 

choose i G { 1 , 2 ,... 

, n} uniformly at random 

5 

set 


6 

Option I: 



— arp-min 'r(*) 

7 

Option II: 

Xk+i = Proj^ [xk 

-uJkW ^yifixk)ei) 

8 

end while 


The following remark compares the result of the above theorem with the cyclic rule. 

Remark 7. It was shown in that for the cyclic coordinate descent method (which is not ran¬ 
domized and hence (fT^ - (fT4l l hold deterministically) we have = 1 , = 7 and 

(pcyclic -^2 _ + y/nLY)"^ = 1 + 2y/nLY + n{LYY- For simplicity, let us assume that 

W = diag(Li, ^ 2 ) ■ • ■) in)- Then = 1 and LY G [Ij'fi]- For the cyclic coordinate descent 
method and SCDM, ujk and ( are the same. However, if we consider the worst case (when LY = n) 

we have that ~ 0(71^), whereas (pcyclic ^2 ^ 0{n^). Also note that one iteration of cyclic 
coordinate descent requires n coordinate updates, whereas SCDM updates just one coordinate, and 
therefore each iteration of SCDM is n times cheaper. In the other extreme, when LY = 1 we have 

that both /3^ ^ ^pcyclic'j 2 ^ 0{n), but again we recall that one iteration of SCDM is n times 
cheaper. 

It turns out that if X ^ R" then SCDM does not fit the R-FDM framework because Vif{xk) 
cannot be bounded by Uccfc — Xk+i\\w- Thus, there is a need to modify R-FDM such that the SCDM 
algorithm can be analyzed for bounded problems. 

The natural modification to R-FDM, which would allow SCDM to fit the R-FDM framework is the 
following: at each iteration k we require that in (fTSl i. only a subset of coordinates of the vector Xk 
are updated. This can be achieved by the following method. 

Definition 8 (Randomized Coordinate Feasible Descent Method (RC-FDM)). Let X = Xi x • • • x 
Xn, where Xi are intervals. A sequence is generated by RC-FDM if there exists /3 > 0, 

C > 0 and {wfej^g with min^ w/c > w > 0 such that for every iteration k, the following are 
satisfied 


Xk+i = Proj^ {xk - uJkW ^(Xf{xk) - Zk)[i\) , (16) 

(\\(zk\4*w?<fi^\\xk-xk+4w. (17) 

f(.Xk+i) < f{xk) - C\\xk - a:fc+i||^, (18) 

where i is a coordinate selected uniformly at random from the set { 1 , 2 ,..., n}, X[j] is a vector 
whose elements j i are set to 0 and Zk is some fixed vector at iteration k. 

Now, we can show that even if X 7 ^ R", SCDM is RC-FDM. The first theorem holds if Option I. is 
used in Algorithm[T]and the second theorem holds if Option II. is used. 

Theorem 9. Let Assumptions\I\ ^and^hold. If X = Xi x • • • x X„, where Xt are intervals 
then the Stochastic Coordinate Descent Method in Algorithm\I\with Option 1. is RC-FDM with 
-p 1 ], ( = 7 , and uJk = 1. 

Theorem 10. Let Assumptions\J]^and\^hold. If X = Xi x • • • x X„, where Xi are intervals then 
the Stochastic Coordinate Descent Method in Algorithm\I\with Option 11. is RC-FDM with Zk = 0, 
C = 7 , = 0, Wfc = 1, and W = diag(Li, L 2 , ■ ■ •, Cn)- 
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3 Convergence Analysis 


In 121 they proved linear convergence for FDM under Assumptions[T]and|2l The following theorem 
shows that a linear convergence rate can also be established for R-FDM. 

Theorem 11 (Linear Convergence of R-FDM). Let Assumptions\I\and^hold. If the sequence 
produced by R-FDM (i.e. (fT2l i- (fT4l i are satisfied) then 

E[/(xfc) - /*] < {/(xo) - n , (19) 

where 

c=^((ir + i)^ + /3^)- ( 20 ) 

The next theorem establishes a linear convergence rate for RC-FDM. 

Theorem 12 (Linear Convergence of RC-FDM). Let X = Xi x ■ ■ ■ x where Xi are intervals. 
Further, let Assumptions\I\and^hold. Let the sequence produced by RC-FDM (i.e. 

are satisfied), then for Zk 0 there exists c G (0, 1) such that all k 

nfM - n < (1 - c)'^ {f{xo) - n • ( 21 ) 

Moreover, if for all k we have zt = 0, and — > max, —, then c = ,, with 

^ dJ ^ ^ ujk — ^ Wi^ n{2u)K-\-l) 

E[/(xfc) - n < (1 - c)^ (^f(xo) -r + ^||xo - xofvv) • (22) 


3.1 Comparison with the Results in Related Literature 

In Theorem[T2]we established the linear convergence of RC-FDM for any Zk. We will now compare 
our result with the one presented in IS) for the projected coordinate gradient descent algorithm. 
Note that the projected coordinate gradient descent algorithm fits the RC-FDM framework exactly. 
We also note that the result in 0 only holds for Zk = 0, so our result is more general. Further, 
even though the paper ||5l considers an asynchronous implementation, where the update computed 
at iteration k is based on gradient information at a point up to r iterations old, if r = 0 then their 
method fits into the RC-FDM framework. One of the benefits of our work is that more general norms 
can be used. So, for simplicity, and to match with the work in Q, let us assume that Li = 1 for 
all i and we also choose Wi = 1 for all i. (This is the case e.g. for the SVM dual problem). The 

geometric rate in (l22l i in our work is then 1- , i and from Theorem 4.1 in ||5| for t = 0 we 

obtain that the geometric rate is 1 — -j, where L^ax > 1 is such that 

l|V/(a:)-V/(a: + <5e,)||oo <Lmax|^| 

holds \/x G M", (5 G M and i G {1, 2,..., n}. Hence, in this case our convergence results are better. 

In 121 the author provided a linear convergence rate for deterministic FDM. It is shown in Theorem 

3.2 in 121 that the coefficient of the linear rate is 1 — where p = -^{Lf ^ whereas, 

in Theorem[T 2 ]of this work, from (fT9l l we see that the coefficient is the same but with a different p. 
To be precise, in our case we have P = -^ result can be better or worse 

than that in l2l, depending on the values of Ly ,ui and /3, but our results holds for R-FDM, which is 
broader than FDM. 

4 Global Error Bound Property 

In this Section we describe a class of problems that satisfies the Global Error Bound (GEB) property. 
We show that this implies the weak strong convexity property and we compare the convergence rate 
obtained in this paper with several results in the current literature derived for problems obeying the 
GEB. We begin by defining the projected gradient. 
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Definition 13 (Projected Gradient). For any x € M" let us define the projected gradient as follows: 

V+/(x) :=a:-Proj^(x-V/(x)). (23) 

Note that projected gradient is zero at x if and only if x is an optimal solution of O- Also, we will 
employ the projected gradient to define an error bound, which measures the distance between x and 
the optimal solution. Now, we are ready to define a global error bound as follows. 

Definition 14 (Definition 6 in ifTSll l. An optimization problem admits a global error bound if there 
is a constant p such that 

Ik-^ll < ??/||V+/(a:)||^, yxGX. (24) 

A relaxed condition called the global error bound from the beginning is if the above inequality holds 
only for x G X such that f{x) — f{x) < M, where M is a constant, and usually we have that 
M = /(xo) - /*. 


Let us consider a special instance of O when X is polyhedral set, i.e. 


X = {x : Bx < c}, (25) 

and the function / has the following structure 

f{x) = h{Ax) + q^x, (26) 

where B G A G h is a Hh strongly convex function and / satisfies Assumption|2] We 

also assume that there exists an optimal solution and hence the optimal solution set X* is assumed 
to be non-empty ifTSl . It is easy to observe that if / is strongly convex, then @ is trivially satisfied. 
Just recently, ||9l showed that if (l24l l is satisfied, then (|5]) is satisfied with 


Lf 


For problem (l26T l it was discussed in ifTSll that 


(27) 


Pf = e^{l + LY)(^ 


/l + 2||V/i(Ax)||' 


CTh 


4M +20||V/(x)||, 


where 0 is a constant from the Hoffman bound ||2]|4l [TSl defined as follows 

T 


9 := sup ■ 


B'^u + 


= l,u > 0 


and the corresponding rows of B, A to u, v’s 
non-zero elements are linearly independent. 


. 


(28) 


(29) 


Note that the constant 9 can be very big (we will provide a brief discussion on this in Section|5]l. 
In |9l they derived that for problem ( |26] |. the weak strong convexity property Q holds with 


(30) 

Note that Kf given in dSOl l is 0{9^) whereas kj obtained from (l27l l is of the order 0^. Therefore we 
will compare our results using the latter estimates of Kf. 


4.1 Comparison with the Results in Related Literature 

In Theorem 8 in ITSl . under the global error bound prope^, it is proven that FDM converges at a 
linear rate: f{xk+i) -/*<(!- -^)(/(xfc) - /*), witlQ 

11 111 1 -I- T 1 

C = -{Lf + ^+ /3)(1 + rjfi^ + fi)) = -ALj + ^ + m + 9^ -+ fi)) 

UJ UJ UJ (J h (jJ 




Tn Gl] it was shown that l l28t . in some special cases (e.g. when X = R"), is pf — 9 
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From Theorem[TT]in this work, we have linear convergence of RC-FDM with the coefficient 

' = 4c + i)‘ + -5') ® 5 ■ 

These coefficients are very similar, but FDM ifTSl covers only cyclic coordinate descent and not a 
randomized coordinate descent method (which is covered by TheoremflTTi. 


5 Linear Convergence Rate of SDCA for Dual of SVM 


In this Section we show that the SDCA algorithm (which is SCDM applied to (fTOl) ) achieves a linear 
convergence rate for the duality gap. This improves upon the result obtained in lfT4l [TSl [161 where 
only a sublinear rate was derived. 

Let us assume, for simplicity, that in problem (|9]l for alH S {1, 2,..., n} it holds that ||aJj< 1. 
Then from lITSlfThl we have that for any x € K”, s € [0,1] and the function / defined in (fl^ we 
have 

2 

fix)-r>sGix)-s^^, (31) 

where /* denotes the optimal value of (ITOl i. A = [oi, 02 ,..., On], <7^ = 4ll"^ll ^ In’ C(a;) 
is the duality gap at the point x, which is defined as G{x) := P{^Ax) + f{x). 

Let us remark that SDCA for problem dTOl l is equivalent to RC-FDM, where the constants in (fThl l- 
(IT^ are given as follows: Zk = 0, (3^ = 0, Wi = Li = and Wfe = 1. Hence, if we choose 

xo = 0 then from TheoremfT^we have that 'E[f{xk) — /*] < (1 — c)^ (/(O) — /* + ||2;*|||) with 

^ 2k f 

~ n{2Kf + l)- 

Now, we see that rearranging OTT i gives 

BU (7^ 1 

G{x) < s- + -{f{x)-n. (32) 

zA s 

If we want to achieve G{x) < e it is sufficient to choose both terms on right hand side of ( l32] t to 
be < Hence, we can set s = minjl, ^}- All we have to do now is to choose k such that 
f(xk) — f* < s|. In the following theorem we establish linear convergence of the duality gap G{x) 
for the SDCA algorithm. 

Theorem 15. Let s = min{l, and let K be such that 

(, . 1 2 (/(o)-r + ll*‘iii) 

+ -s-■ 

Then if the SDCA algorithm is applied to problem (doll to produce {xk}'^^Q, then \/k > K we have 
that'Ej[G{xk)] < e. 


Let us now comment on the size of the parameter Kf ^ In our case, X is the polyhedral set 

( l25l l defined hy B = {—In In)^ , and c = (0^, 1^)^, where /„ G is the identity matrix. 

Because of this structure ( [29] ) simplihes to 


6 := sup 

u^n 




and the corresponding rows of A to u, u’s 
non-zero elements are linearly independent. 


(33) 


To show that 6 can be very large, let us assume that two rows of the matrix A are highly correlated 
(in this case rows corresponds to features). We denote these two rows by Ai and A 2 , and let us 
assume that Ai = A 2 + dei. Then we can chose 0,..., 0)^ and u = 0. This particular 

choice is feasible in optimization problem (l33l l and hence is imposing a lower-bound on 0: 9 > 
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6 Summary 


In this paper we have extended the framework of the feasible descent method FDM into a random¬ 
ized and a randomized coordinate FDM framework. We have provided a linear convergence rate 
(under the weak strong convexity assumption) for both methods and we have shown that the con¬ 
vergence rates are similar to the deterministic/non-randomized FDM. We showed that for the cyclic 
coordinate descent method the coefficients in FDM are worse or similar to the stochastic coordinate 
descent method (and hence the theory tells us that they converge at roughly the same speed), but 
each iteration of the stochastic coordinate descent method is n-times cheaper. We concluded the 
paper with a result showing that, for the SDCA algorithm applied to the dual of the linear SVM, the 
duality gap converges linearly. 
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A Proofs 


A. 1 Proof of Theorem |6] 

Let us define an auxiliary vector x such that 


— ( 2 ) • £(( (t) (2) (i—1) (%) ( 2 -I-I) 

xMeXi 




Then we can see that if coordinate i is chosen during iteration k in Algorithm [T] then 

1 otherwise. 


(34) 


(35) 


If coordinate i is chosen during iteration fc, then the optimality conditions for Step |6] of Algorithm 
[T] give us that 


(i) 

<^k+l 


= Projj 


tv 


^if{Xk+l) 1 ■ 

Wi 


(36) 


Moreover, by ( |35] ). for j ^ iif/e have that ^ = ^i+i which is possible only if = 'S/jf{xk)- 

Note that Xk+i is a random variable, which depends on i and Xk only. Therefore, we can define a 
random Zk such that the i-th coordinate is 

4*^ = V,/(a;fc) - Vifiix^k '^, ..., ,..., a;[,” V) + Wi{x^^'^ - ) (37) 

and the j-th coordinate (for j i) is defined as z^^'^ — Vjf{xk)- It is easy to verify that for Zk 
defined above, condition (fT^ holds. Now, we will compute E[(||zfc||^)^]. We have that if the *-th 
coordinate is chosen then 


w^ 



( 2 — 1 ) ~| 
. . . , Xj, ,x 

4) 

■ • ■ 


( 2 — 1 ) ~l 

. . .^Xk 

4) 

^Xk , 

■•■:4"E)) +2u;i(4*^ 


5 '^k ’ ’ 

™4+i) 

Xk 5 • 

>4”E)llw)^+ 2wi(4*^ -iW)2 


< 2(L7 ||x, - (4^^ ..., ^ 






= 2(Ljf + 2w,ix^^ - = 2[(Lf )2 + - ^«)^ 


(38) 


otherwise 

Hence, we obtain that 
(ID 


-(4*V = -(V,/(a:,))^ 

Wi Wi 


m\zk 




2=1 


n ^ Wi 
2=1 


From the optimality condition of Step|6]of Algorithm[T] and the fact that Xi = R, we know that for 
all i the following holds: 






= 0 . 


(40) 


Therefore Vi we have 


— {XJ(xk)f = —(VJ(:rfc) - VJ{x^^ 


(i— 1 ) ~(i) (i+1) 

' rpX’^J ' 

k ■ ■>^k ’ ^ ■ ■>^k 


M\\2 


)f 
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If we denote by = max^ then we obtain from 


( 3 gl n 

mMwf] < E iMLfr + 1 ] + 

n 

= miLj f + 1 ] + - ,«!•>)“ 

n 

= ( 2 [(Lf f + 1 ] + (n- l)r 2 ) 

2^1 

= (2[{lY )^ + 1] + (n — l)r^) E[||a;fc — Xk+i\\w] 
and we can conclude that (fTST l holds with /3^ = 2 [(L^)^ + 1 ] + (n — l)r^. 

Now, it remains to show ( fTSl l. From we know that 

- 4*^) < 0 - 


(41) 


Therefore from (fTsT i with ^ = xj! and x = {Xf. ,..., 
have that 


.( 1 ) 


*-l) j(i) 


T^k 1 ■ • ■ 1 -^fc 


(n)^T ® 


= Xk +1 we 


f{xk) - f{xk+i) > -rwi\xY - 4+ii^ + y^f{xk+l)ixY - 4+i) *i* 7w^ii4^ -4+1^ (^2) 


Therefore 


ED 


f{Xk)- f{Xk+l) > JWilxY -xY+i\^ =l\\Xk-Xk+l\\w- 

and by taking expectation on both sides of the above, Gil follows with C = 7- 

A.2 Proof of Theorem |9] 

This proof is very similar to the proof of Theorem| 6 ] Let us define an auxiliary vector x in the same 
way as in 01 . Then we can see that if coordinate i is chosen during iteration k in AlgorithmGIthen 
OSl l holds, and the optimality conditions for Step| 6 ]of Algorithm[T]imply that 01 holds. 

Note that Xk+i is a random variable which depends on i and Xk only. Therefore, we can define Zk 
such that i-th coordinate is given by OtI i. It is easy to verify that for Zk defined in OtI i. the condition 
Cl holds. Now, let us compute (|j(zfc)[i] ||^)^- We have that 


Wi 


xi' -4+iiiw- 


Therefore, we conclude that Cl holds with /3^ = 2[(Lj?^)^ + 1]. 

Now, it remains to show (fTsT l. Again from 01 we know that (HTI) holds. Therefore from d with 
5 = xY and a; = (a;^, • ■ •, • ■ •, Xk+i we have (l42li. Therefore 


ED 


f[xk) - f(,Xk+i) > IWi^k - 4+il^ = 7l|a;fc - a;fe+i||^, so Cl holds with C = 7- 

A.3 Proof of Theorem ITT] 

This proof is based on the proof of Theorem 3.2 in H. We can write the optimality conditions for 
a;fc+i from Cl and using the definition of a projection given in (H. We have that Vx G X, the 
following inequality holds 

{W (xfe+i - Xk + WfcVF“^(V/(xfe) - Zk)) ,x- Xk+i) > 0. (43) 

Now, using the convexity of / we obtain that 

/(Xfe+i) - /* = /(xfe+i) - /(Xfe+i) < (V/(Xfe+i),Xfe+i - Xk+l) 

= (V/(xfc+i) - V/(xfe) + Vf{xk), Xfc+i - Xfc+i). (44) 
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Plugging X = Xk+i into (l43T l we obtain 

- Zk,Xk+i - Xfc+i^ > (V/(x/c),Xfe+i - Xfe+i) . (45) 

Plugging this into (|44] | gives us that 

( 44 ),| 45 ) , _ \ 

/(Xfc+i) - /(Xfc+i) < (yf{xk+l)-yfixk)-^W{xk+l-Xk) + Zk,Xk+l-Xk+l) 

CS 

< ||V/(Xfc+i) - V/(Xfe)||^||x/c+l - Xk+l\\w 

+ - Xk) + Zk,Xk+l - Xfe+1^ 

^ w 

< Lj ||xfc+i — Xk\\w\\xk+i — Xk+i\\w 


“t“ { {Xk+l Xk) ^ Xk-\-l Xk-\-l') {Zkf Xk-\-'i ^fc+l) 

CS 

< Lj \\xk+i — Xk\\w\\xk+i — Xk+i\\w] 

+ hW^i^k+i - xfc)||;^||xfc+i - xk+i\\w + ll^^fellwlkfc+i - xk+i\\w 

= ((^r + h)\\^k+l - XkWw + \\Zk\\*w) lkfc+1 - Xk+l\\w 

° ((^r + s)ll®fc+i - ^k\\w + \\zk\\w) \ -^if (xk+i) - fixk+l))- (46) 

y '^f 

Therefore, we can conclude that 

ED 1 2 

fixk+l) — f* < - {i^Y h)\\^k+l — Xk\\w -‘r \\zk\\\v) ■ (47) 

Taking the expectation of (1471) with respect to the random vector Zk, we obtain 
E[/(xfc+i) - f{xk+i)\ ? —E 

< - {{Lf + i)2E[iixfc+i - xk^w] + mUkrwf]) 

< {i^Y hY + pY e[||x/c — Xfe+iii^] 

< — {i^Y + hY + 7 ifi^k) - E[/(xfe+i)]) 

Kf I, 

= - {(LY + ^Y + 7 ifM - fixk) + E[/(x,+i)] - E[/(x,+i)]). 

Kf (, 


{{lY + i)||xfc+i — Xk\\w + Ikfellw) 


(48) 

Finally, from (l48T l we obtain that 

E[/(a;fc+i) - /*] = E[/(xfe+i) - /(xfc+i)] < -Y— {f{xk) - fixk+i)) = -Y— (/(xfc) - /*), 

1 + c 1 + c 

and the result follows. 

A.4 Proof of Theorem [T2l if Zk = 0 

Let us define an auxiliary vector x such that 

xW = Proj5" {xk - WfcVF”^(V/(xfe) - Zk)[{])^^. (49) 

Then we can see that if coordinate i is chosen during iteration k in Algorithm[T]then 


Aj) _ 

•^fc+i ~ 


Xk\ if ji^i, 

xW, otherwise. 


(50) 
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Therefore, let us estimate the expected value of / at a random point Xk+i^ where the expectation is 
taken with respect to the selection of coordinate i at iteration k. Let h € M". Then if — > maxi — 
we have 


O 

E[/(a;fc + < f{xk) + E 

— / i^k) + E 


U 


(V/(a;fc),+ 2 ^ll^[i]llw 
(y f{xk), h[i]) + 


^ f{xk) + -{yf{xk),h) + ^\ 

n \ 2uJk 




/ 


- - -f{xk) + - 

n n 


f{xk) + y f{xk) - Zk, h) + -—I 

2ujk 


'H{h;xk,Zk) 


\w + i^k, h) 


(51) 


Now, observe that 


i = Xfc + arg min 'H{h]Xk,Zk) 

h:x-\-Xk^X 

= Xk + arg min {'H{h] Xk,Zk) + $x(a; + Xk)} =: Xk + h, 
where ^x{x) is the indicator function for the set X, i.e. 

, , fO, \fx & X, 

<i>j(:(a;) = { , . 

1 ^+ 00 , otherwise. 

From the first order optimality conditions of dMT) we have 

^f{.Xk) - Zk + —Wh + s = 0, 

where s G d^{xk + h). We can define a composite gradient mapping ll^ fTTlfTTl 


(52) 


(53) 


(54) 


g ■= - Wh. 

Wfe 


(55) 


Therefore, we can observe that 


(54) 

f{xk) + Zk + g € d^{xk + h). 


(56) 


It is also easy to show that 


\\h\\^w = \\^kW-y\\^^ = u;i{\\gr^r 


(57) 


and 

/g,h) = -J-fh\\^^m-u;k{\\gr^r. (ss) 

\ / UJk 

Finally note that for any y G X we have 

\\xk + h — yW"^ = — 2 /||^ + 2a;fe {g,y — Xk) + ||^||^ 

^ \\xk — y\\w + {g, y — Xk) + a;^(|| (/||^)^. (59) 
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Now, we are ready to bound T-L{h\ Xk, Zk) + + h) for h = h. We have 

n{h]Xk,Zk) + ^{xk + h) 

= f{xk) + (yf{xk) - Zk, + -^W^Ww + + h) 


m / -\ 1 

< f{y) + {yf{xk),xk -y) + (yf{xk) - zk,hj + —I 

+ ^{y) + (-V/(a;/c) + Zk+g,Xk + h-y 
1 


\w 


= f{y) + ^(y) 




^ + {g,xk - y) + {zk^xk - y) + {g,h 

* \2 


* \2 


fiy) + ‘i’(y) + -^^kiWgWw)^^ + {9,^k — y) + {zk,Xk — y) — WfedlsIliV) 
= f{y) + ^(y) - ]^^k{\\9\\*wf + {g^^k - y) + {zk,Xk - y) 


w m +*(») - ^ 


Xk + h-yW^- \\xk - y\\w') + i^k, Xk - y) 
2 


/(y) + ^’(y) - ^ (nE[|la;fe+i - - n\\xk - y\\w) + i^k, Xk - y) ■ 

Now, from (fSTI) we conclude that My we have 

E[/(a;fc+i)] <- f{xk) + - /(y) + $(y) - — (E[||a;fe+i - y||^] - \\xk - y\\w) + {zk,Xk + h-y) 

n n \ Zcjk ' \ / 

which can be equivalently written as 

< fi^k) + - y\\w - -ifixk) - f{y) - $(y)) + - (zk,Xk + h 

2(jJk n n \ 


E 


f{xk+i) + -^W^k+i - y\\w 


If we choose y = Xk then the latter inequality reads as follows: 


E 


1 


f{xk+l) + T- \\xk+l - Xk 

ZUJk 


l|2 

W 


< f{xk) + “ ^k\\w - Z^fi^k) - f*) + Z (^k,Xk + h- Xk) ■ 


n n 

From the definition of x we obtain that ||a;fe+i — Xk+i || w < H^Cfe+i — Xk\\w therefore 


E 


/(Xfc+l) - /* + ^W^k+l - Xk+l\\w < (1 - ^)if{Xk) - /*) + - ^k\ 

2Gjk. 


2 

W 


+ - (zk,Xk + h - 


Let us assume that \/k : Zk = 0 . Then let us define c = ^ Then 


f{xk+l) f + ^k+l\\w 


E 

Therefore 

E[/(xfc)-r] < E 


^ <{^-c)[f{xk)-f* + ^\\xk-Xk 


7^ l|2 

W 


(60) 


f(Xk) - f* + l^W^k - XkWw 


l60l 

< (l-c)'" ( /(a:o) - /* + ^11^0 - xoww 


A.5 Proof of TheoremllZlif Zk 0 

The proof follows similar arguments to the proof of Theorem [12] when Zk = 0- Let us define an 
auxiliary vector x in the same way as in (|49j). Then we can see that if coordinate i is chosen during 
iteration k in Algorithm [T] then (fSOl l holds. Therefore, let us estimate the expected value of / at a 
random point Xk+i, where the expectation is taken with respect to the selection of coordinate i at 
iteration k. Let h € K”. Then if > maxi ^ we have that (jsTji holds. Now, observe that 

i = Xfc + arg min 'H{h]Xk,Zk) 

h\x-\-Xk^X 

= Xfc + arg min Xk,Zk) + ^x{h + Xk)} =■ Xk + h, (61) 
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where is indicator function for set X, (l53T l. Now, we have 

n{h;xk,Zk) = min | f{xk) + (V/(xfc) - Zk,h) + 7 ^||/i||^ + + Xk) 

ZiUj^ 

= min I f{xk) + {yf{xk) - Zk,y- Xk) + - Xk\\w + ^x{y) 

yeR" L zwk 

< min \ f{Xxk + {1 - X)xk) + {-Zk,X{xk - Xk)) + T^WHxk - Xk)\\w + ^x{X{xk 

AG[ 0 , 1 ] ZUJk 

< min \ Xf{xk) + {I - X)f{xk) + X\\zk\\w\\xk - Xk\\w +-^\\xk - XkWwf ■ 

ag[o,i] 2uJk J 

Note that from dSOl) and dMT) we have 


\w 


Gl 


= '^\\hi\\\w ='n'm\xk+i-XkWw] < -zMfiXk) - fiXk+l)]- (62) 
2=1 ^ 


Therefore, we conclude that 

E[/(a;fc+i) - /*] niin {/(xfc) - /* + ^iX{f{xk) - fixk)) + X\\zk\\w\\xk - Xk\\w 

Ag[0,1] 


+ ~ ^k\\w + ll^fcllM^ll^llw)} 


m 


< min {f{xk) - f* + i(-A(/(xfc) - /*) + A||zfe||^|| 2 ;fe - Xk\\w 

Ag[0.1] 

+ ,^{f{xk)-n + \Ukrwmw)}- 

ZUJkKf 

Now, let us denote by ^k = f{xk) — f* and ^^+1 = E[/(xfc) — /*] (where the expectation is with 
respect to the random choice i during the fc-th iteration). Notice that 


= ^(ll(^fc)[*] lltv) ^ n—(^fc —^fc+i). 


(63) 


Therefore we have 


A2 


Cfc+i ^ niin + —{—X^k + ~ a;fc||w + - - ^k + ll-ZfclliVll^llw)} 


ag[o.i] 


2uJkKf 


{m.im A^ nB 

< min + ^{—X^k + '^ll^fellrvll^fe “ 3^fe||w + -Cfc 3—“ Cfc+i))} 

AG[0.1] " 2uJkl^f C 


which is equivalent to 


( 1 - 


- f )Cfe+l<(l + f )Cfc + + n^\\^k\\w\\xk - Xk\\w + ^ 2 t^kKf^k} 

< (1 + f )Cfc + + ^Av^n^(^fc - Cfe+l) 

Using the fact that Va, b S K.+ we have \/aI) < we obtain that 

(1 + |)Cfc+l<(l + f )?fc + + \/ ^i^k - ^k+l)\J + nli^^k} 

1A2 1 


<(1 + f )Cfc + + ^(C/c — Cfc+i) + X 

s ag[o,i] 2(; 2 


n Kf 


'^k + n 2Jl:Kf^k}- 


Therefore, we obtain 

+ f + 2c + f + 2c w)Kfc. 


Xk) + Xk) 
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The optimal A* that minimizes the above expression is 


A* = max i 1 


(D + 1 


Consider now two cases: 

• A* < 1. In this case 


(A*)^ 


1 {ujKff 




Combining this with dMl i gives 

(1 + f + + f + fr “ 

which is equivalent to 


/3 1 


^ 2nuj + 1 




2nui + 12C + 2l3 + l3)^ 

A* = 1. In this case > 1 and hence 

a;+l — 

— X*LdKf H- - -(1 + w) = —U)Kf + -(1 + w) < —OJKf + = 




Therefore, from (l64] | we can conclude that 


Ck+i< 1 - 


c 


n(2C + 2/3 + /32) 


Cfe- 
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